From outlier function we know that ApplicantIncome and CoapplicantIncome has outlier.
To solve that will be removed from the datasets.

Code Above removed outlier record

KNN Impute after outlier handling

Scaling Data KNN impute after outlier handling

Table Above showing the scaled dataset using MinMaxScaler

KNN Impute datasets before outlier handling

from the graph above we can see that 7 neighbors shows better accuracy

imputation data showing regression value. next code will be rounding the prediction value so it can match with the datasets

Scaling Data KNN impute before outlier handling

since the impute before outlier shows better performance by the avg of the p value tested using shapiro with higher value compared to imputed after outlier, in this section only show scaled SMOTE sample. for the raw you can change the variable name from "oversampled" to "df_loan_imputer_before_outlier_rounded".

the graph above showing SMOTE data sampling distribution, there are still outlier in the data but we will keep use it to train to the model since we wanna compared it to the raw model with the same data condition.

Machine Learning Raw Model

Logistic Regression

Decision Tree Classifier

KNN Classifier

Support Vector Machine Classifier

From above raw model classifier we know that for logistic regression, KNN, SVC shows the model tending to approve loan because the data has more data with approved loan.
Decision tree model shows us better result for the balance in recall but with lower accuracy value and still has the higher value on the approved loan.

Since the model not quite good, in this notebook will do optimizing the data to handle the imbalance such as feature engineering to support the model.

Univariate Analysis

From histplot above we can see that:

  • Approved loan > Rejected loan
  • Married > non Married, in the datasets
  • Male > Female, in the datasets
  • Graduate > non Graduate, in the datasets
  • non Self Employed > Self Employed, in the datasets
  • Rural > Urban > Semiurban, in the datasets
  • From histplot above we know that data with approved loan status has more record in the datasets compared with rejected loan status, this kind of condition possible to make bias for the machine learning model. So we will optimze the data before we create the machine learning to lower the bias

    from the graph above we know that 0 dependents has higher values with non married, make sense since non married more probably had no children. But there are data shows us that married has dependents too with mostly has 2 dependents and followed by 1 dependents the lower values at >= 3 dependents.

    Bivariate Analysis

    The figure shows us that approved loan has more distribute higher coapplicant income compare with rejected loan in rural and semiurban area. But for the urban property are (left figure) for the graduated data, rejected loan has more higher distributed data compare to approved loan.

    from the data we know that:

  • graduated has more higher distributed applicant income on suburban for approved loan compared to urban and rural.

  • un graduated rural area has more higher distributed applicant income for rejected loan compared to approved loan.

  • un graduated urban area has more higher applicant income on rejected loan compared to approved loan.

  • most of the data right distributed

  • from the graph above shows that Income (applicant or coapplicant) doesnt had signifance correlation to determine whether loan status is Y or N.

    SMOTE optimizer sampling

    from the result above we already had the balance target class and it will be use to the model to see if there any better perfomance compare to raw model.

    ML model after optimizing sampling

    Logistic Regression

    Data for the train scaled using the code above in the segment "scaling data before outlier"

    Decsion Tree Classifier

    KNN Classifier

    Support Vector Machine

    raw model machine learning :

  • avg f1-score rejected loan = 0.64
  • avg f1-score approved loan = 0.86

  • the gap between f1-score approved and rejected loan for raw model 0.22.

    SMOTE model machine learning:
  • avg f1-score rejected loan = 0.71
  • avg f1-score approved loan = 0.77

  • the gap between f1-score approved and rejected loan for SMOTE model 0.06.

    from the result above we know that SMOTE model has better result for balancing the prediction with lower gap between approved loan and rejected loan f1-score

    This notebook shows for raw model has better value in accuracy, but not quite good in the f1-score since raw model more likely to overfit in the approved loan influence by the data has more approved loan.
    After using SMOTE to handle the imbalance data shows us better performance in the f1-score since the SMOTE mdoe has lower gap avg value, but the performance of accuracy lower compared to raw model.